{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Training Notebook\n",
    "\n",
    "Clean and extract features from raw data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "# Steps\n",
    "\n",
    "1. Split the data into training and test data set\n",
    "1. Clean the data (transform null values)\n",
    "1. Scale necessary attributes (normalization, standardization)\n",
    "1. Save transformed data for model training\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model\n",
    "from sklearn import svm\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# serializing, compressing, and loading the models\n",
    "import joblib\n",
    "from tabulate import tabulate\n",
    "# performance\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "# displaying plots\n",
    "import numpy as np\n",
    "import sys\n",
    "sys.path.append(\"../lib\")\n",
    "\n",
    "from getConfig import *\n",
    "config = getConfig(\"../\")\n",
    "config.cleanup(config.trained_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Selection\n",
    "\n",
    "While there are several classifiers available, we show how to train the following classifiers, compare and select one.\n",
    "\n",
    "[Classification Model Comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html?highlight=svm%20svc)\n",
    "\n",
    "1. [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)\n",
    "1. [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)\n",
    "1. [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\n",
    "\n",
    "If you are interested in a more extensive collection, i.e, training other kinds of models and comparing them to pick the best one, please refer\n",
    "\n",
    "https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "### Model Evaluation\n",
    "https://scikit-learn.org/stable/modules/model_evaluation.html\n",
    "\n",
    "In this notebook, we use the following APIs to evaluate the quality of a model’s predictions. For more detailed evaluation and other metrics, please refer to the link above\n",
    "\n",
    "1. Estimator score method (Score)\n",
    "2. Scoring parameter (Cross validation)\n",
    "3. Metric functions (F1 score from Classification metrics)\n",
    "\n",
    "Best practice to save every model you experiment with so you can come back easily to any model.\n",
    "Save both the hyperparameters and trained parameter, as well as the cross-validation scores and predictions.\n",
    "This will allow you to easily compare scores across model types. Use Pickle or joblib libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load training and test data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "with open(config.traintest_path + \"X_train_prepared.csv\") as file_name:\n",
    "    X_train_prepared = np.loadtxt(file_name, delimiter=\",\")\n",
    "\n",
    "with open(config.traintest_path + \"X_train_prepared_m.csv\") as file_name:\n",
    "    X_train_prepared_m = np.loadtxt(file_name, delimiter=\",\")\n",
    "    \n",
    "with open(config.traintest_path + \"y_train.csv\") as file_name:\n",
    "    y_train = np.loadtxt(file_name, delimiter=\",\")\n",
    "    \n",
    "with open(config.traintest_path + \"y_test.csv\") as file_name:\n",
    "    y_test = np.loadtxt(file_name, delimiter=\",\")\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Initialize results and define method to evaluate models\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Initialize rows and columns of table to store results\n",
    "\n",
    "results = [[\"Support Vector Classifier\"], \n",
    "          [\"Random Forest Classifier\"], \n",
    "          [\"Logistic Regression\"]]\n",
    "  \n",
    "#define header names\n",
    "col_names = [\"Estimator score method\", \"Scoring parameter\", \"Metric functions\"]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "def eval_model(model):\n",
    "\n",
    "    # ESTIMATOR SCORE METHOD\n",
    "\n",
    "    accuracy = model.score(X_train_prepared,y_train)\n",
    "\n",
    "    # SCORING PARAMETER\n",
    "    cross_validation = cross_val_score(model, X_train_prepared, y_train, cv=3, scoring='recall_macro')\n",
    "\n",
    "    # METRIC FUNCTIONS\n",
    "\n",
    "    y_pred = model.predict(X_train_prepared)\n",
    "    f1score = f1_score(y_train, y_pred)\n",
    "\n",
    "    print ('Accuracy:' + str(accuracy) +'\\nCross validation score: ' + str(cross_validation) + '\\nf1 score: ' + str (f1score))\n",
    "\n",
    "    var = [accuracy, cross_validation, f1score]\n",
    "    return var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Support Vector Classifier (SVC)\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# select the SVM\n",
    "clf = svm.SVC()\n",
    "\n",
    "# fit the model to the data\n",
    "CLF = clf.fit(X_train_prepared, y_train)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy:0.9118198874296435\n",
      "Cross validation score: [0.5 0.5 0.5]\n",
      "f1 score: 0.06622516556291391\n"
     ]
    }
   ],
   "source": [
    "result0 = eval_model(CLF)\n",
    "for item in result0:\n",
    "    results[0].append(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Store Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../experiments/experiment_0/models/trained/svc_model.pkl']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# save and compress the model\n",
    "\n",
    "joblib.dump(CLF, config.trained_path + \"svc_model.pkl\", compress=('bz2', 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Forest Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier()"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select the estimator\n",
    "CLF = RandomForestClassifier()\n",
    "\n",
    "# fit the model to the data\n",
    "CLF.fit(X_train_prepared, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy:1.0\n",
      "Cross validation score: [0.73339777 0.70602125 0.76117389]\n",
      "f1 score: 1.0\n"
     ]
    }
   ],
   "source": [
    "result1 = eval_model(CLF)\n",
    "for item in result1:\n",
    "    results[1].append(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Store Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../experiments/experiment_0/models/trained/rfc_model.pkl']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# save the model\n",
    "joblib.dump(CLF, config.trained_path + \"rfc_model.pkl\", compress=('bz2', 3))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logisticregression#sklearn.linear_model.LogisticRegression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression()"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# select the classifier\n",
    "CLF = LogisticRegression()\n",
    "\n",
    "# fit the model to the data\n",
    "CLF.fit(X_train_prepared, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Model Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy:0.9049405878674172\n",
      "Cross validation score: [0.5042311  0.49173554 0.51937511]\n",
      "f1 score: 0.05\n"
     ]
    }
   ],
   "source": [
    "result2 = eval_model(CLF)\n",
    "for item in result2:\n",
    "    results[2].append(item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Store Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['../experiments/experiment_0/models/trained/log_model.pkl']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# save the model\n",
    "joblib.dump(CLF, config.trained_path + \"log_model.pkl\", compress=('bz2', 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Model Selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                             Estimator score method  Scoring parameter                     Metric functions\n",
      "-------------------------  ------------------------  ----------------------------------  ------------------\n",
      "Support Vector Classifier                  0.91182   [0.5 0.5 0.5]                                0.0662252\n",
      "Random Forest Classifier                   1         [0.73339777 0.70602125 0.76117389]           1\n",
      "Logistic Regression                        0.904941  [0.5042311  0.49173554 0.51937511]           0.05\n"
     ]
    }
   ],
   "source": [
    "#display table\n",
    "print(tabulate(results, headers=col_names))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "fp = open(config.trained_path + \"model_comparison.txt\", \"w\")\n",
    "fp.write(tabulate(results, headers=col_names))\n",
    "fp.close()\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Launch, Monitor, and Maintain\n",
    "\n",
    "1. Use joblib to save the train model inclulding full pre-processing and prediction pipeline\n",
    "1. Load the trained model to production\n",
    "1. Call the predict() method to make predictions\n",
    "\n",
    "## Serving\n",
    "1. Load the model in a web app that will call the predict() method\n",
    "1. Wrap the model in a dedicated web service that a web app queries with REST API\n",
    "    1. makes it easy to upgrade without interrupting the primary web app\n",
    "    1. makes it easy to scale web services and load balance the requests from the web app across the web services\n",
    "    1. enables the web app to use any language, not just Python\n",
    "\n",
    "## Monitor\n",
    "1. Write monitor code the check live performance at regular intervals and trigger alerts when it drops\n",
    "    1. Could be steep drop if an infrastructure components stops\n",
    "    1. or, a gentle decay as the world changes resulting in model rot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
